Introduction

This dataset contains the spirits purchase information of Iowa Class “E” liquor licensees by product and date of purchase from January 1, 2012 to 2017. The dataset can be used to analyze total spirits sales in Iowa of individual products at the store level. Link is here

The content that we will cover in this analysis:

  • What kinds of liquor and how much in sales is being sold in different days of the week, on weekends and various seasons
  • Sales per store name
  • Sales per category name
  • Best buys in categories
  • Most favourite bottle size
  • No. of stores per city / county
  • Dataset with missing categories
  • Sales per city
  • Category per city
  • How people buy in quantities (something more bottles at the same time other only per bottle)
  • Vendor analysis (who’s selling what, who’s the cheapest, etc.)

Note: The original data is cut from big dataset from Kaggle

In [ ]:
#need to run this code if you need to install cufflinks
!pip install cufflinks
In [ ]:
#need to run this command if you need to install chart_studio
!pip install chart_studio

Import the working libraries

In [40]:
import pandas as pd
import dask.dataframe as dd
from dask import compute
import dask
import plotly.graph_objects as go
import chart_studio.plotly as py
import cufflinks as cf
from plotly.offline import iplot, init_notebook_mode
from plotly.offline import iplot
from matplotlib import pyplot as plt
import plotly.express as px
import numpy as np
import csv    
In [ ]:
df_full = dd.read_csv("liquor.trimmed.csv",engine="python",
            error_bad_lines=False,dtype={
       'Bottles Sold': 'float64',
       'County Number': 'object',
       'Item Number': 'object',
       'Pack': 'float64',
       'Store Number': 'object',
       'Unnamed: 0': 'object',
       'Vendor Number': 'object',
        'Bottle Volume (ml)': 'float64', 
            'Zip Code': 'object'})
In [5]:
df_full.compute().shape
Out[5]:
(2000942, 24)

There are 2000942 rows and 24 columns in total

In [ ]:
df_pandas = df_full.compute()
In [7]:
df_dask = dd.from_pandas(df_pandas, npartitions = 8)
In [8]:
df_dask.columns
Out[8]:
Index(['Invoice/Item Number', 'Date', 'Store Number', 'Store Name', 'Address',
       'City', 'Zip Code', 'Store Location', 'County Number', 'County',
       'Category', 'Category Name', 'Vendor Number', 'Vendor Name',
       'Item Number', 'Item Description', 'Pack', 'Bottle Volume (ml)',
       'State Bottle Cost', 'State Bottle Retail', 'Bottles Sold',
       'Sale (Dollars)', 'Volume Sold (Liters)', 'Volume Sold (Gallons)'],
      dtype='object')

Clean the data

In [9]:
df_dask.isnull().sum().compute()
Out[9]:
Invoice/Item Number         0
Date                       32
Store Number               32
Store Name                 32
Address                  1443
City                     1442
Zip Code                 1442
Store Location           1442
County Number            1442
County                   1442
Category                 5931
Category Name            5931
Vendor Number              33
Vendor Name                33
Item Number                32
Item Description           32
Pack                       32
Bottle Volume (ml)         32
State Bottle Cost          36
State Bottle Retail        36
Bottles Sold               36
Sale (Dollars)             36
Volume Sold (Liters)       36
Volume Sold (Gallons)      36
dtype: int64
In [10]:
df_dask = df_dask.dropna()
In [11]:
df_dask.compute().shape
Out[11]:
(1993599, 24)
In [12]:
df_dask['State Bottle Cost'] = df_dask['State Bottle Cost'].str.replace('$', '')
df_dask['State Bottle Retail'] = df_dask['State Bottle Retail'].str.replace('$', '')
df_dask['Sale (Dollars)'] = df_dask['Sale (Dollars)'].str.replace('$', '')
In [13]:
df_dask = df_dask.astype({'Store Number': 'int64', 'County Number': 'int64',
                          'Vendor Number': 'int64', 'Pack': 'int64', 'Bottles Sold': 'int64',
                          'State Bottle Cost': 'float64', 'State Bottle Retail': 'float64',
                         'Sale (Dollars)': 'float64' })
In [14]:
df_dask['Actual Sales(Dollar)'] = df_dask['State Bottle Retail']*df_dask['Bottles Sold']
In [15]:
df_dask.head(5)
Out[15]:
Invoice/Item Number Date Store Number Store Name Address City Zip Code Store Location County Number County ... Item Description Pack Bottle Volume (ml) State Bottle Cost State Bottle Retail Bottles Sold Sale (Dollars) Volume Sold (Liters) Volume Sold (Gallons) Actual Sales(Dollar)
0 INV-02190200029 12/15/2016 2593 Hy-Vee Food Store / Carroll 905 US Highway 30 West Carroll 51401 905 US Highway 30 West\nCarroll 51401\n(42.070... 14 CARROLL ... Kessler Blend Whiskey 6 1750.0 11.01 16.52 2 99.12 3.50 0.92 33.04
0 INV-02834700007 01/20/2017 4073 Uptown Liquor, Llc 306 Hwy 69 South Forest City 50436 306 Hwy 69 South\nForest City 50436\n(43.26153... 95 WINNEBAGO ... Old Crow 6 1750.0 10.49 15.74 1 94.44 1.75 0.46 15.74
0 INV-07032400009 09/01/2017 4762 Huber's Store Inc. 101 Main St Fort Atkinson 52144 101 Main St\nFort Atkinson 52144\n(43.143391, ... 96 WINNESHIEK ... Paramount Blackberry Brandy 6 1750.0 11.29 16.94 1 101.64 1.75 0.46 16.94
0 INV-07866600008 10/10/2017 3926 Liquor Downtown / Iowa City 315 S Gilbert St Iowa City 52240 315 S Gilbert St\nIowa City 52240\n(41.657654,... 52 JOHNSON ... Paramount White Rum 12 750.0 4.00 6.00 1 72.00 0.75 0.19 6.00
0 INV-04285200002 04/11/2017 4482 Indy 66 West #929 / Indianola 2001 W 2nd Ave Indianola 50125 2001 W 2nd Ave\nIndianola 50125\n(41.356153, -... 91 WARREN ... Captain Morgan Original Spiced 12 750.0 9.06 13.59 1 163.08 0.75 0.19 13.59

5 rows × 25 columns

Beginning of Nhu's part

The following part is analysed by Nhu Nguyen

Data Analysis

Top drinking cities and counties

In [16]:
city_volume = df_dask.groupby('City')['Volume Sold (Liters)'].sum().compute().sort_values(ascending = False)
In [17]:
top_volume = city_volume.head(20)
In [18]:
fig = px.bar(top_volume, title = 'Top 20 liquor consumption cities in Iowa state',
            color_discrete_sequence = ['forestgreen'])
fig.update_layout(yaxis_title = 'Volume Sold (Liters)', xaxis_title = 'Iowa State Cities')
fig.show()
In [19]:
county_volume = df_dask.groupby('County')['Volume Sold (Liters)'].sum().compute().sort_values(ascending = False)
In [20]:
county_volume = county_volume.head(20)
In [21]:
fig = px.bar(county_volume, title = 'Top 20 liquor consumption counties in Iowa state')
fig.update_layout(yaxis_title = 'Volume Sold (Liters)', xaxis_title = 'Iowa State Counties')
fig.show()

Polk is the most populous county of Iowa state when it's hosting over 14% of the state's presidents. The liquor consumption of this county is kind double the second one (600k in comparison to 300k).

Top five drinks

In [22]:
item_dcpt = df_dask.groupby(['City', 'Item Description']).sum().compute()
In [23]:
item_dcpt.reset_index(inplace = True)
In [24]:
item_sold_dm= item_dcpt[item_dcpt['City'] == 'Des Moines'].sort_values(by = 'Volume Sold (Liters)',ascending = False)
item_sold_cr= item_dcpt[item_dcpt['City'] == 'Cedar Rapids'].sort_values(by = 'Volume Sold (Liters)',ascending = False)
item_sold_d= item_dcpt[item_dcpt['City'] == 'Davenport'].sort_values(by = 'Volume Sold (Liters)',ascending = False)
item_sold_wdm= item_dcpt[item_dcpt['City'] == 'West Des Moines'].sort_values(by = 'Volume Sold (Liters)',ascending = False)
item_sold_cb= item_dcpt[item_dcpt['City'] == 'Council Bluffs'].sort_values(by = 'Volume Sold (Liters)',ascending = False)
In [25]:
item_sold = pd.concat([item_sold_dm.head(5), item_sold_cr.head(5),item_sold_d.head(5),
                      item_sold_wdm.head(5),item_sold_cb.head(5)])
In [26]:
px.bar(item_sold, x = 'City', y = 'Volume Sold (Liters)', color = 'Item Description')

There are 11 popular liquors in 5 cities:

  • Fireball Cinnamon Whiskey Mini Dispenser is the most favorite liquor when its consumption is especially high.
  • Black Velvet is the second most favorite
  • Hawkeye Vodka, Captain Morgan Spiced Rum are in top favorite

Actual sales of stores

In [27]:
store_locate = df_dask.groupby(['Store Name','Store Location']).sum().compute()
In [28]:
store_locate.reset_index(inplace = True)
In [29]:
store_locate_splt = store_locate['Store Location'].str.split('\\n', n = 3, expand = True)
In [30]:
store_locate['Longitute']= store_locate_splt[2].str.split(',', expand = True)[0].str.replace('(', '')
In [31]:
store_locate['Latitute'] = store_locate_splt[2].str.split(',', expand = True)[1].str.replace(')', '')
In [32]:
store_locate.dropna(inplace = True)
In [33]:
store_locate
Out[33]:
Store Name Store Location Store Number County Number Category Vendor Number Pack Bottle Volume (ml) State Bottle Cost State Bottle Retail Bottles Sold Sale (Dollars) Volume Sold (Liters) Volume Sold (Gallons) Actual Sales(Dollar) Longitute Latitute
0 "Double ""D"" Liquor Store" 618 Rossville Road\nWaukon 52172\n(43.262069, ... 1071576 1476 5.150096e+08 126606 5818 496550.0 5009.73 7515.84 776 41130.36 783.96 204.39 11466.58 43.262069 -91.473622
1 "Double ""D"" Liquor Store" 618 Rossville Road\nWaukon 52172\n(43.262071, ... 3312738 4563 1.587373e+09 392704 18083 1541100.0 15682.67 23530.19 2475 137763.13 2541.24 663.28 37476.60 43.262071 -91.473622
2 1st Stop Beverage Shop 2839 East University Ave.\nDes Moines 50317\n(... 5426671 76769 1.047339e+09 258927 16651 757025.0 8465.81 12701.94 1277 133772.35 916.81 236.27 16007.64 41.600476 -93.561517
3 218 Fuel Express & Chubby's Liquor 68 Monroe St\nFloyd 50435\n(43.121605, -92.737... 3490200 23800 7.438568e+08 182410 8031 654575.0 7390.77 11088.44 1673 76316.51 1731.16 453.70 27641.92 43.121605 -92.737916
4 218 Fuel Express & Chubby's Liquor 68 Monroe St\nFloyd 50435\n(43.121613, -92.737... 872550 5950 1.828855e+08 47883 2038 163000.0 1783.90 2676.27 394 17283.43 370.21 96.89 5861.11 43.121613 -92.737905
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1936 Casey's General Store #3385 704 8th St SE\nOrange City 51041\n(42.997508, ... 367160 5628 6.976530e+07 17105 947 50050.0 585.92 879.05 289 2327.68 190.69 49.95 3550.04 42.997508 -96.049493
1937 Casey's General Store/ Toledo 113 Hwy 30 West\nToledo 52342\n(41.986402, -92... 111657 1806 2.173410e+07 5386 233 19725.0 192.58 289.29 52 1974.49 41.07 10.71 757.24 41.986402 -92.580381
1938 Kwik Stop 16th St #300 1215 E 16th St\nDubuque 52001\n(42.511657, -90... 189848 1178 3.939100e+07 8159 766 20050.0 269.97 405.06 274 1758.05 110.63 28.98 2264.01 42.511657 -90.656121
1939 Riverside Quick Shop 802, S Riverview\nBellevue 52031\n(42.252688, ... 85323 833 1.764710e+07 5056 306 15375.0 123.90 186.26 24 1998.47 22.61 5.88 321.69 42.252688 -90.420558
1940 Casey's General Store # 1591/ Decorah 401 College Dr.\nDecorah 52101\n(43.308961, -9... 310764 5472 5.913040e+07 14738 852 40575.0 491.43 737.29 249 1325.66 157.38 41.23 2957.72 43.308961 -91.79783

1820 rows × 17 columns

In [34]:
store_locate["Text"] = store_locate['Store Name'] + ":" + store_locate['Actual Sales(Dollar)'].astype(str)
fig = go.Figure(data=go.Scattergeo(
        locationmode = 'USA-states',
        lon = store_locate['Latitute'],
        lat = store_locate['Longitute'],
        text = store_locate['Text'],
        mode = 'markers', marker_color = store_locate['Actual Sales(Dollar)'],
        ))

fig.update_layout(
        title = 'Actual sales of each store in Iowa state',
    geo_scope = 'usa'
    )
fig.show()

Beginning of Radim's part

The following part is analysed by Radim Musalek

The most favourite bottle sizes

Liquor in the U.S. can be sold in various sizes and we'd expect the 750ml and 1000ml to be among the most popular ones. We decided to analyse the dataset to find out which bottle sizes were the most sold in Iowa over the given period of time.

In [35]:
bottle_volume_pop_full = df_dask.groupby("Bottle Volume (ml)").sum().reset_index()
In [36]:
fig_pop = px.bar(bottle_volume_pop_full.compute().sort_values("Bottle Volume (ml)"),
              x="Bottle Volume (ml)", y="Bottles Sold")
fig_pop.update_layout(title = "Favourite bottle sizes",
                  yaxis_title = "Bottles Sold",
                  xaxis =dict(title="Bottle Volume (ml)",
                    type="category"))

As we can see from the chart above the 750ml was the most sold bottle size as most stronger liquor is being sold in this volume. This was followed by 1750ml, often strong liquor for events is being sold in this size, then 375ml, i.e. half of 750ml size, and then also by the originally expected 1000ml. The sales volumes of the other sizes were mostly insignicant.

Two theories on the use of 750ml:

  • it said to be a suitable ration for a man with dinner. This was meant for wine consumption but it seems the liquor industry took over this standard.
  • this size of bottle was the largest that early glassblowers could produce with one full breath.

There was also one surprising bottle size, 31.5 litre bottles, which we decided to investigate further to find out what kind of liquor is being sold in such bottles.

In [37]:
bot_31500 = df_dask[df_dask["Bottle Volume (ml)"] == 31500]
In [38]:
fig_31500 = px.bar(bot_31500.compute(),
              x="Item Description", y="Bottles Sold", color="Item Description")
fig_31500.update_layout(title = "31.5 litre selection",
                  yaxis_title = "Bottles Sold")

There were 3 kinds of liquor sold in the 31.5litre bottles:

  • Seagrams whiskey
  • Smirnoff vodka
  • Captain Morgan rum

However, as we can see from the product descriptions the sold products were actually only packages of 18 1.75 litre bottles so nothing unusual.

Shops with the biggest and the smallest selection

There hundreds of shops in the Iowa state so we would like to find the one with the largest and the smallest selection. This could help us for example with recommendations for buyers where to find a broad selection and which shops are rather for specialised purchases.

In [41]:
selection = df_dask.compute().pivot_table(index="Category Name", columns="Store Name", values="State Bottle Retail", aggfunc=[lambda x: True if x.any() != np.nan else False])
selection.columns = selection.columns.get_level_values(1)
In [42]:
selection_tot = {}
for shop in selection.columns:
    sel = selection[shop].sum()
    selection_tot[shop] = sel
In [43]:
sel_pd = pd.DataFrame.from_dict(selection_tot, orient="index", columns=["Product selection"])
In [44]:
sel_top20 = sel_pd.nlargest(20, "Product selection")
sel_low20 = sel_pd.nsmallest(20, "Product selection")
In [45]:
fig_top20 = px.bar(sel_top20, y="Product selection", x=sel_top20.index)
fig_top20 = fig_top20.update_layout(title = "Top20 shops in selection variability",
                  yaxis_title = "Product groups in selection",
                  xaxis_title="Shop name")

fig_low20 = px.bar(sel_low20, y="Product selection", x=sel_low20.index)
fig_low20 = fig_low20.update_layout(title = "Lowest 20 shops in selection variability",
                  yaxis_title = "Product groups in selection",
                  xaxis_title="Shop name")
In [46]:
fig_top20.show()
In [47]:
fig_low20.show()

We'd expect higher differences in the largest shops offerings but since the products are grouped into 57 categories in our dataset the level of the most sold products per shop in the largest selection is pretty flat, ranging from 56 products for the top1 shop to 53 products for the top20 shop.

There're two groups in the top20 list:

  • Hy-Vee, a supermarket chain from Iowa covering 75% of the top20 list
  • various independent Wine and Spirit shops with the remaining 25% of spots in the top20

In the section with the lowest variation of products we can find mostly local distileries offerring their own products which would be surely in contrast with the Finnish market (if investigated) where distileries aren't allowed to sell directly to retail customers.

Beginning of Dmitry's part

This following part is analysedby Dmitry Buiskikh

In [77]:
from __future__ import print_function
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets

from plotly.subplots import make_subplots
In [49]:
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
          'Jul', 'Aug', 'Sep', 'Oct']
weekdays = ["Mon", "Tue", "Wed", "Thu", "Fri","Sat" ]
weeks = [str(i) for i in range(1,45)]
In [50]:
df_dask["Date"].dtype
Out[50]:
dtype('O')
In [51]:
df_dask["Date"] = dd.to_datetime(df_dask["Date"],dayfirst=False)
df_dask["Date"].dtype
Out[51]:
dtype('<M8[ns]')
In [52]:
dfSorted = df_dask.compute().sort_values(by="Date").reset_index()
In [53]:
type(dfSorted)
Out[53]:
pandas.core.frame.DataFrame
In [54]:
dfSorted.head()
Out[54]:
index Invoice/Item Number Date Store Number Store Name Address City Zip Code Store Location County Number ... Item Description Pack Bottle Volume (ml) State Bottle Cost State Bottle Retail Bottles Sold Sale (Dollars) Volume Sold (Liters) Volume Sold (Gallons) Actual Sales(Dollar)
0 7468 INV-02215400001 2016-12-08 2848 CVS Pharmacy #10114 / Ankeny 104 E First St Ankeny 50023 104 E First St\nAnkeny 50023\n(41.7319, -93.59... 77 ... Jose Cuervo Authentic Light Margarita Lime 6 1750.0 8.20 12.30 1 73.80 1.75 0.46 12.30
1 3151 INV-02201900001 2016-12-14 4745 Walgreens #05042 / Cedar Rapids 3325 16th Ave SW Cedar Rapids 52404 3325 16th Ave SW\nCedar Rapids 52404\n(41.9634... 57 ... Black Velvet 6 1750.0 10.45 15.68 1 0.00 1.75 0.46 15.68
2 3152 INV-02201900002 2016-12-14 4745 Walgreens #05042 / Cedar Rapids 3325 16th Ave SW Cedar Rapids 52404 3325 16th Ave SW\nCedar Rapids 52404\n(41.9634... 57 ... Jose Cuervo Authentic Strawberry Lime Margarit... 6 800.0 3.88 5.82 3 0.00 2.40 0.63 17.46
3 39959 INV-02301400001 2016-12-14 4888 Circle S Bluff Stop 302 1st St Sergeant Bluff 51054 302 1st St\nSergeant Bluff 51054\n(42.400333, ... 97 ... 99 Cinnamon Mini 10 600.0 6.00 9.00 1 0.00 0.60 0.15 9.00
4 0 INV-02190200029 2016-12-15 2593 Hy-Vee Food Store / Carroll 905 US Highway 30 West Carroll 51401 905 US Highway 30 West\nCarroll 51401\n(42.070... 14 ... Kessler Blend Whiskey 6 1750.0 11.01 16.52 2 99.12 3.50 0.92 33.04

5 rows × 26 columns

In [55]:
DFD = dfSorted.set_index("Date")

Setting Datetime MultiIndex

In [56]:
DFD_MI = dfSorted.set_index([DFD.index.year, "Date"]) 
In [57]:
DFD_MI.head()
Out[57]:
index Invoice/Item Number Store Number Store Name Address City Zip Code Store Location County Number County ... Item Description Pack Bottle Volume (ml) State Bottle Cost State Bottle Retail Bottles Sold Sale (Dollars) Volume Sold (Liters) Volume Sold (Gallons) Actual Sales(Dollar)
Date Date
2016 2016-12-08 7468 INV-02215400001 2848 CVS Pharmacy #10114 / Ankeny 104 E First St Ankeny 50023 104 E First St\nAnkeny 50023\n(41.7319, -93.59... 77 Polk ... Jose Cuervo Authentic Light Margarita Lime 6 1750.0 8.20 12.30 1 73.80 1.75 0.46 12.30
2016-12-14 3151 INV-02201900001 4745 Walgreens #05042 / Cedar Rapids 3325 16th Ave SW Cedar Rapids 52404 3325 16th Ave SW\nCedar Rapids 52404\n(41.9634... 57 LINN ... Black Velvet 6 1750.0 10.45 15.68 1 0.00 1.75 0.46 15.68
2016-12-14 3152 INV-02201900002 4745 Walgreens #05042 / Cedar Rapids 3325 16th Ave SW Cedar Rapids 52404 3325 16th Ave SW\nCedar Rapids 52404\n(41.9634... 57 LINN ... Jose Cuervo Authentic Strawberry Lime Margarit... 6 800.0 3.88 5.82 3 0.00 2.40 0.63 17.46
2016-12-14 39959 INV-02301400001 4888 Circle S Bluff Stop 302 1st St Sergeant Bluff 51054 302 1st St\nSergeant Bluff 51054\n(42.400333, ... 97 WOODBURY ... 99 Cinnamon Mini 10 600.0 6.00 9.00 1 0.00 0.60 0.15 9.00
2016-12-15 0 INV-02190200029 2593 Hy-Vee Food Store / Carroll 905 US Highway 30 West Carroll 51401 905 US Highway 30 West\nCarroll 51401\n(42.070... 14 CARROLL ... Kessler Blend Whiskey 6 1750.0 11.01 16.52 2 99.12 3.50 0.92 33.04

5 rows × 25 columns

The DataSet starts from the middle of December 2016. In order to make the analyze more descriptive, we decided to concentrate only on the year of 2017. In fact, on 10 months of 2017, as the DataSet ends on 2017-10-31.

In [58]:
DFD2017 = DFD_MI.loc[2017]
In [59]:
type(DFD2017)
Out[59]:
pandas.core.frame.DataFrame
In [60]:
DFD2017.shape
Out[60]:
(1875820, 25)

10 months of 2017 have 1.87 mln of rows. Enough for the analyze to be reliable.

In [61]:
DFD2017.tail()
Out[61]:
index Invoice/Item Number Store Number Store Name Address City Zip Code Store Location County Number County ... Item Description Pack Bottle Volume (ml) State Bottle Cost State Bottle Retail Bottles Sold Sale (Dollars) Volume Sold (Liters) Volume Sold (Gallons) Actual Sales(Dollar)
Date
2017-10-31 136043 INV-08362000017 5113 Ray's Supermarket, Inc. 1975 Franklin St Waterloo 50703 1975 Franklin St\nWaterloo 50703\n(42.495394, ... 7 BLACK HAWK ... Five Star 12 750.0 3.50 5.25 1 63.00 0.75 0.19 5.25
2017-10-31 136044 INV-08362000018 5113 Ray's Supermarket, Inc. 1975 Franklin St Waterloo 50703 1975 Franklin St\nWaterloo 50703\n(42.495394, ... 7 BLACK HAWK ... New Amsterdam Vodka Mini 10 500.0 5.54 8.31 1 83.10 0.50 0.13 8.31
2017-10-31 130788 INV-08346800012 5460 Smokin Joe's Tobacco & Liquor #6 2411 2nd St Ste 4 Coralville 52241 2411 2nd St Ste 4\nCoralville 52241\n(41.68361... 52 JOHNSON ... Remy Martin Vsop Cognac 24 200.0 6.16 9.24 6 9.24 1.20 0.31 55.44
2017-10-31 130786 INV-08346800010 5460 Smokin Joe's Tobacco & Liquor #6 2411 2nd St Ste 4 Coralville 52241 2411 2nd St Ste 4\nCoralville 52241\n(41.68361... 52 JOHNSON ... New Amsterdam Gin 24 375.0 3.50 5.25 4 5.25 1.50 0.39 21.00
2017-10-31 132596 INV-08351800029 2663 Hy-Vee / Urbandale 8701 Douglas Urbandale 50322 8701 Douglas\nUrbandale 50322\n(41.629364, -93... 77 POLK ... Ice Hole Butterscotch Schnapps 12 750.0 5.15 7.73 1 92.76 0.75 0.19 7.73

5 rows × 25 columns

Sales statistics

Monthly sales statistics

In [62]:
DFD2017Month = DFD2017.resample("M").sum()
In [63]:
DFD2017Month.head(2)
Out[63]:
index Store Number County Number Category Vendor Number Pack Bottle Volume (ml) State Bottle Cost State Bottle Retail Bottles Sold Sale (Dollars) Volume Sold (Liters) Volume Sold (Gallons) Actual Sales(Dollar)
Date
2017-01-31 21598357024 634314642 9734970 1.788269e+11 45013108 2140966 155846100.0 1699679.76 2551317.13 364834 1.314165e+07 305523.61 79760.1 5080743.93
2017-02-28 22083847428 597561862 9262258 1.672019e+11 41933316 2006070 146665450.0 1578004.45 2367546.03 367640 1.265068e+07 331334.72 86646.4 5218200.56
In [64]:
fig = px.line(x=months, y=DFD2017Month["Actual Sales(Dollar)"])
    
fig.update_traces(mode='lines+markers')
    
fig.update_layout(
          title='Total monthly sales, $',
                  yaxis_title='$',
                  yaxis_range =[0, 8000000],
                  xaxis_title=None,
                 

                  
                  showlegend = False,
                 autosize=True, height = 500,
                                width = 1000  
                 )
fig.show()

Weekly sales statistics

In [65]:
DFD2017Week = DFD2017.resample("W").sum()
In [66]:
fig = px.line(x=weeks[:-1], y=DFD2017Week["Actual Sales(Dollar)"][:-1])
    
fig.update_traces(mode='lines+markers')
    
fig.update_layout(
          title='Total weekly sales, $',
                  yaxis_title='$',
                  yaxis_range =[0, 2000000],
                  xaxis_title=None,
                  
                  
                  showlegend = False,
                 autosize=True, height = 500,
                                width = 1000  
                 )
fig.show()

Statistic on days of week

In [67]:
DFD2017["weekday"] = DFD2017.index.day_name().str[:3]
In [68]:
DFD2017["weekday"] = DFD2017.index.day_name().str[:3]
In [69]:
paivat = []
for day in weekdays[:-1]:

  paivat.append(int(DFD2017[DFD2017["weekday"]==day]["Actual Sales(Dollar)"].sum().round(0)))
In [70]:
fig = px.line(x=weekdays[:-1], y=paivat)
    
fig.update_traces(mode='lines+markers')
    
fig.update_layout(
          title='Sales by day of week, $',
                  yaxis_title='$',
                  yaxis_range =[0, 16000000],
                  xaxis_title=None,
                  
                  
                  showlegend = False,
                 autosize=True, height = 500,
                                width = 700  
                 )
fig.show()

Statistics based on Category of the beverages

In [71]:
def pivotTable(kolonna, parameter):
  A = pd.pivot_table(DFD2017, index = "Category Name", 
                       columns = kolonna,
                       values = [parameter],
                       aggfunc = np.sum)
  return A
In [72]:
def plotting(period, A, B, C, choice, podpis):

  fig = go.Figure()

  specs = [[{'type':'xy'}], [{'type':'xy'}], [{'type':'xy'}]]
  fig = make_subplots(rows=3, cols=1, specs=specs,
        subplot_titles=
        ("Bottles Sold","Volume Sold", "Sales")
                      )

  fig.add_trace(go.Bar(
      x=period,
      y=A.loc[choice],
    
      name=choice + " (Bottles Sold)",
      marker_color="red"
      ),1,1)

  fig.add_trace(go.Bar(
      x=period,
      y=B.loc[choice],

      name= choice + " (Volume Sold (Liters))",
      marker_color="blue"
      ),2,1)

  fig.add_trace(go.Bar(
      x=period,
      y=C.loc[choice],

      marker_color="green"
      ),3,1)
      
  fig.update_yaxes(title_text="Bottles", row=1, col=1)
  fig.update_yaxes(title_text="Liters", row=2, col=1)
  fig.update_yaxes(title_text="$", row=3, col=1)

  fig.update_layout(height=800, width=800, 
                    showlegend=False, 
                    title={
                  'text': choice + " ("+ podpis +")",
                  'y':0.98,
                  'x':0.5},)

  fig.show()
In [ ]:
def byBeverage(choice):
  
  A = pivotTable(DFD2017.index.weekday, "Bottles Sold")
  B = pivotTable(DFD2017.index.weekday, "Volume Sold (Liters)")
  C = pivotTable(DFD2017.index.weekday, "Actual Sales(Dollar)")  
  plotting(weekdays, A,B,C, choice, "Days of week")


  A = pivotTable(DFD2017.index.week, "Bottles Sold")
  B = pivotTable(DFD2017.index.week, "Volume Sold (Liters)")
  C = pivotTable(DFD2017.index.week, "Actual Sales(Dollar)")  
  plotting(weeks, A,B,C, choice, "Weekly")

  A = pivotTable(DFD2017.index.month, "Bottles Sold")
  B = pivotTable(DFD2017.index.month, "Volume Sold (Liters)")
  C = pivotTable(DFD2017.index.month, "Actual Sales(Dollar)")  
  plotting(months, A,B,C, choice, "Monthly")
In [ ]:
interact(byBeverage,choice=DFD2017["Category Name"])
In [ ]:
DFD2017["Category Name"].unique()
In [78]:
byBeverage("Gold Rum")

This following part is analysed by Biswajit Paul. It consists:

  • Sales per store(most/least)
  • Sales per category(most/least)

Sample dataset was created from original dataset and both sample dataset and original dataset analysed.

Sample data

In [79]:
sample = df_dask.sample(frac = 0.002)
In [80]:
sample['Store Name'].compute().nunique()
Out[80]:
982
In [81]:
sample['Category Name'].compute().nunique()
Out[81]:
53

Sample dataset is among 997 shops and 50 alco items

Top5 seller from sample dataset

In [82]:
a=total_maximum_sales_per_store=sample.groupby(['Store Name'])['Actual Sales(Dollar)'].sum().nlargest(5)
In [83]:
fig = go.Figure()
b=a.index.compute()
y=a.compute()

def SetColor(y):
        if(y >= 3000):
            return "lime"
        elif(y >= 1500):
            return "purple"
        elif(y >= 1000):
            return "coral"

fig.add_trace((go.Bar(x=b,y=a,marker=dict(color = list(map(SetColor, y))))))
fig.update_layout(
    title={
        'text': "top      5      seller"
        ,
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})

fig.show()

Flop 5 seller from sample dataset

In [84]:
c=total_minimum_sales_per_store=sample.groupby(['Store Name'])['Actual Sales(Dollar)'].sum().nsmallest(5)
In [85]:
import plotly.graph_objects as go

fig = go.Figure()
d=c.index.compute()
y=c.compute()

def SetColor(y):
        if(y <= 2):
            return "red"
        elif(y <= 3):
            return "blue"
        elif(y <= 5):
            return "green"

fig = go.Figure((go.Bar(x=d, y=c,marker=dict(color = list(map(SetColor, y))))))
fig.update_layout(
    title={
        'text': "flop    5    seller",
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})

fig.show()

Top 5 best selling alco from sample dataset

In [86]:
e=total_maximum_sales_per_category=sample.groupby(['Category Name'])['Bottles Sold'].sum().nlargest(5)
In [87]:
import plotly.graph_objects as go

fig = go.Figure()
f=e.index.compute()
y=e.compute()

def SetColor(y):
        if(y >= 1000):
            return "lime"
        elif(y >= 600):
            return "purple"
        elif(y >= 400):
            return "coral"

fig.add_trace((go.Bar(x=f,y=e,
                      marker=dict(color = list(map(SetColor, y))))))
fig.update_layout(
    title={
        'text': "top     5   best  selling  alco"
        ,
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})

fig.show()

5 least selling alco sample dataset

In [88]:
g=total_minimum_sales_per_category=sample.groupby(['Category Name'])['Bottles Sold'].sum().nsmallest(5)
In [89]:
import plotly.graph_objects as go

fig = go.Figure()
h=g.index.compute()
y=g.compute()

def SetColor(y):
        if(y <= 1):
            return "red"
        elif(y <= 2):
            return "blue"
        elif(y <= 5):
            return "green"

fig = go.Figure((go.Bar(x=h, y=g,marker=dict(color = list(map(SetColor, y))))))
fig.update_layout(
    title={
        'text': "flop    5    least selling alco",
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})

fig.show()

next step is to implement it into the whole dataset

In [90]:
df_dask['Store Name'].compute().nunique()
Out[90]:
1490
In [91]:
df_dask.groupby(['Store Name'])['Actual Sales(Dollar)'].sum().nlargest(20).compute()
Out[91]:
Store Name
Hy-Vee #3 / BDI / Des Moines             956531.94
Central City 2                           914918.86
Hy-Vee Wine and Spirits / Iowa City      536011.51
Hy-Vee Food Store / Cedar Falls          497918.77
Hy-Vee #4 / Wdm                          425773.47
Hy-Vee Wine and Spirits / WDM            406651.77
Costco Wholesale #788                    404129.04
Sam's Club 8162 / Cedar Rapids           389006.09
Hy-Vee Wine and Spirits / Bettendorf     379934.70
Hy-Vee Food Store #1 / Mason City        371038.35
Hy-Vee #3 Food and Drugstore             366481.71
Hy-Vee Food Store / Coralville           365631.74
Cyclone Liquors                          364659.62
Hy-Vee / Waukee                          363310.84
Hy-Vee Food Store #2 / Council Bluffs    360798.13
Hy-Vee Wine and Spirits #2               357347.15
Central City Liquor, Inc.                356517.46
Hy-Vee Food Store / Marion               351038.48
Hy-Vee Food Store #1 / Council Bluffs    346575.75
Sam's Club 6344 / Windsor Heights        344182.31
Name: Actual Sales(Dollar), dtype: float64
In [92]:
df_dask.groupby(['Store Name'])['Actual Sales(Dollar)'].sum().nsmallest(20).compute()
Out[92]:
Store Name
Iowa Legendary Rye                         37.50
Thome Enterprises LLC                      48.00
Swell Liquor                              170.70
Casey's General Store # 1331/ Sac City    265.64
Lonely Oak Distilery                      273.00
That Iowa Girl                            326.29
K & S Grocery                             348.01
Crossroads of Hampton                     375.80
Story City Market                         400.82
Y-Knot                                    414.52
Crossroads of Sac City                    433.25
VBJ Beverages LLC                         487.50
Dehner Distillery                         523.43
Broadbent Distillery                      575.68
Ackley Superfoods                         602.26
Hawkeye Convenience Stores / 16th Av      634.82
Randhawa's Travel Center                  683.09
Fas Mart # 5159/ Dubuque                  698.29
Charley's Quik Shop                       705.42
Riverside Quick Shop                      756.80
Name: Actual Sales(Dollar), dtype: float64

Total shops 1490

Top seller: Hy-Vee 3 / BDI / Des Moines 956K $

Least seller: Iowa Legendary Rye 37.5 $

Top5 seller

In [93]:
i=total_maximum_sales_per_store=df_dask.groupby(['Store Name'])['Actual Sales(Dollar)'].sum().nlargest(5)
In [94]:
import plotly.graph_objects as go

fig = go.Figure()
j=i.index.compute()
y=i.compute()

def SetColor(y):
        if(y >= 800000):
            return "lime"
        elif(y >= 4000000):
            return "purple"
        elif(y >= 200000):
            return "coral"

fig.add_trace((go.Bar(x=j,y=i,marker=dict(color = list(map(SetColor, y))))))
fig.update_layout(
    title={
        'text': "top      5      seller"
        ,
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})

fig.show()

Flop 5 seller

In [95]:
k=total_minimum_sales_per_store=df_dask.groupby(['Store Name'])['Actual Sales(Dollar)'].sum().nsmallest(5)
In [96]:
import plotly.graph_objects as go

fig = go.Figure()
l=k.index.compute()
y=k.compute()

def SetColor(y):
        if(y >= 250):
            return "green"
        elif(y >= 150):
            return "blue"
        elif(y >= 20):
            return "red"

fig = go.Figure((go.Bar(x=l, y=k,marker=dict(color = list(map(SetColor, y))))))
fig.update_layout(
    title={
        'text': "flop    5    seller",
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})

fig.show()
In [97]:
df_dask['Category Name'].compute().nunique()
Out[97]:
57
In [98]:
df_dask.groupby(['Category Name'])['Bottles Sold'].sum().nlargest(20).compute()
Out[98]:
Category Name
American Vodkas                 551157
Canadian Whiskies               501948
Spiced Rum                      257647
Straight Bourbon Whiskies       247893
Whiskey Liqueur                 234867
American Flavored Vodka         226071
Blended Whiskies                165978
American Brandies               161147
Imported Vodkas                 143576
American Vodka                  142032
American Schnapps               137733
Tennessee Whiskies              137261
Mixto Tequila                   111577
Flavored Rum                    109578
White Rum                       107817
100% Agave Tequila              105906
Imported Cordials & Liqueurs    102291
Imported Brandies                93838
Imported Flavored Vodka          91751
American Dry Gins                87102
Name: Bottles Sold, dtype: int64
In [99]:
df_dask.groupby(['Category Name'])['Bottles Sold'].sum().nsmallest(20).compute()
Out[99]:
Category Name
American Gins                               5
Temporary &  Specialty Packages             9
Imported Distilled Spirit Specialty       287
Delisted Items                            420
Imported Cordials & Liqueur               428
Iowa Distillery Whiskies                  618
American Distilled Spirits Specialty     1069
Iowa Distilleries                        1272
Mezcal                                   1304
American Sloe Gins                       1424
Bottled in Bond Bourbon                  2039
Holiday VAP                              2597
Corn Whiskies                            3378
Flavored Gin                             3692
American Distilled Spirit Specialty      4260
Single Barrel Bourbon Whiskies           4331
Neutral Grain Spirits Flavored           4531
Neutral Grain Spirits                    7529
Aged Dark Rum                            9861
Imported Distilled Spirits Specialty    13292
Name: Bottles Sold, dtype: int64

Among 57 alco items

Top 5 best selling alco

In [100]:
m=total_maximum_sales_per_category=df_dask.groupby(['Category Name'])['Bottles Sold'].sum().nlargest(5)
In [101]:
import plotly.graph_objects as go

fig = go.Figure()
n=m.index.compute()
y=m.compute()

def SetColor(y):
        if(y >= 500000):
            return "lime"
        elif(y >= 200000):
            return "purple"
        elif(y >= 400):
            return "coral"

fig.add_trace((go.Bar(x=n,y=m,
                      marker=dict(color = list(map(SetColor, y))))))
fig.update_layout(
    title={
        'text': "top     5   best  selling  alco"
        ,
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})

fig.show()

5 least selling alco

In [102]:
o=total_minimum_sales_per_category=df_dask.groupby(['Category Name'])['Bottles Sold'].sum().nsmallest(5)
In [103]:
import plotly.graph_objects as go

fig = go.Figure()
p=o.index.compute()
y=o.compute()

def SetColor(y):
        if(y <= 50):
            return "red"
        elif(y <= 300):
            return "blue"
        elif(y <= 450):
            return "green"

fig = go.Figure((go.Bar(x=p, y=o,marker=dict(color = list(map(SetColor, y))))))
fig.update_layout(
    title={
        'text': "flop    5    least selling alco",
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})

fig.show()